Search for: All records

Creators/Authors contains: "Dey, Tapajit"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Representation of Developer Expertise in Open Source Software

https://doi.org/10.1109/ICSE43902.2021.00094

Dey, Tapajit; Karnauch, Andrey; Mockus, Audris (June 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE))
null (Ed.)
Full Text Available
Effect of technical and social factors on pull request quality for the npm ecosystem

Dey, Tapajit; Mockus, Audris (October 2020, Empirical software engineering)

ackground: Pull Request (PR) Integrators often face challenges in terms of multiple concurrent PRs, so the ability to gauge which of the PRs will get accepted can help them balance their workload. PR creators would benefit from knowing if certain characteristics of their PRs may increase the chances of acceptance. Aim: We modeled the probability that a PR will be accepted within a month after creation using a Random Forest model utilizing 50 predictors representing properties of the author, PR, and the project to which PR is submitted. Method: 483,988 PRs from 4218 popular NPM packages were analysed and we selected a subset of 14 predictors sufficient for a tuned Random Forest model to reach high accuracy. Result: An AUC-ROC value of 0.95 was achieved predicting PR acceptance. The model excluding PR properties that change after submission gave an AUC-ROC value of 0.89. We tested the utility of our model in practical scenarios by training it with historical data for the NPM package \textit{bootstrap} and predicting if the PRs submitted in future will be accepted. This gave us an AUC-ROC value of 0.94 with all 14 predictors, and 0.77 excluding PR properties that change after its creation. Conclusion: PR integrators can use our model for a highly accurate assessment of the quality of the open PRs and PR creators may benefit from the model by understanding which characteristics of their PRs may be undesirable from the integrators' perspective. The model can be implemented as a tool, which we plan to do as a future work
more » « less
Full Text Available
The Secret Life of Hackathon Code Where does it come from and where does it go?

https://doi.org/10.1109/MSR52588.2021.00020

Imam, Ahmed; Dey, Tapajit; Nolte, Alexander; Mockus, Audris; Herbsleb, James D. (May 2021, Mining Software Repositories)
null (Ed.)
Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existing code or how much new code they develop during a hackathon. Moreover, we aim to understand if and where that code gets reused, and what factors affect reuse. Method: We collected information about 22,183 hackathon projects from DEVPOST– a hackathon database – and obtained related code (blobs), authors, and project characteristics from the WORLD OF CODE. We investigated if code blobs in hackathon projects were created before, during, or after an event by identifying the original blob creation date and author, and also checked if the original author was a hackathon project member. We tracked code reuse by first identifying all commits containing blobs created during an event before determining all projects that contain those commits. Result: While only approximately 9.14% of the code blobs are created during hackathons, this amount is still significant considering time and member constraints of such events. Approximately a third of these code blobs get reused in other projects. The number of associated technologies and the number of participants in a project increase reuse probability. Conclusion: Our study demonstrates to what extent pre-existing code is used and new code is created during a hackathon and how much of it is reused elsewhere afterwards. Our findings help to better understand code reuse as a phenomenon and the role of hackathons in this context and can serve as a starting point for further studies in this area.
more » « less
Full Text Available
A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

https://doi.org/10.1145/3379597.3387500

fry, tanner; dey, tapajit; Karnauch, Andrey; mockus, Audris (June 2020, IEEE International Working Conference on Mining Software Repositories)

The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.
more » « less
Full Text Available
World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

https://doi.org/10.1007/s10664-020-09905-9

Ma, Yuxing; Dey, Tapajit; Bogart, Chris; Amreen, Sadika; Valiev, Marat; Tutko, Adam; Kennard, David; Zaretzki, Russell; Mockus, Audris (March 2021, Empirical Software Engineering)
null (Ed.)
Full Text Available
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

Dey, Tapajit; Ma, Yuxing; Mockus, Audris (September 2019, In Proceedings of the 15th International Conference on Predictive Models and Data Analytics in Software Engineering)

Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Open source components depend on each other directly and transitively, and evidence suggests that more effort is required for reporting and resolving the issues reported further upstream in this supply chain. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68, and variables representing aggregate participation patterns proved to be the important predictors. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further upstream.
more » « less
Full Text Available
Detecting and Characterizing Bots that Commit Code

https://doi.org/10.1145/3379597.3387478

Dey, Tapajit; Mousavi, Sara; Ponce, Eduardo; Fry, Tanner; Vasilescu, Bogdan; Filippova, Anna; Mockus, Audris (June 2020, IEEE International Working Conference on Mining Software Repositories)

Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.
more » « less
Full Text Available
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

https://doi.org/10.1145/3345629.3345634

Dey, Tapajit; Ma, Yuxing; Mockus, Audris (August 2019, PROMISE'19: Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering)

Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Open source components depend on each other directly and transitively, and evidence suggests that more effort is required for reporting and resolving the issues reported further upstream in this supply chain. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68, and variables representing aggregate participation patterns proved to be the important predictors. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further
more » « less
Full Text Available
A Methodology for Measuring FLOSS Ecosystems

https://doi.org/https://doi.org/10.1007/978-981-13-7099-1_1

Amreen, Sadika; Bichescu, Bogdan; Bradley, Randy; Dey, Tapajit; Ma, Yuxing; Mockus, Audris; Mousavi, Sara; Zaretzki, Russell (July 2019, Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability)

FLOSS ecosystem as a whole is a critical component of world’s computing infrastructure, yet not well understood. In order to understand it well, we need to measure it first. We, therefore, aim to provide a framework for measuring key aspects of the entire FLOSS ecosystem. We first consider the FLOSS ecosystem through lens of a supply chain. The concept of supply chain is the existence of series of interconnected parties/affiliates each contributing unique elements and expertise so as to ensure a final solution is accessible to all interested parties. This perspective has been extremely successful in helping allowing companies to cope with multifaceted risks caused by the distributed decision-making in their supply chains, especially as they have become more global. Software ecosystems, similarly, represent distributed decisions in supply chains of code and author contributions, suggesting that relationships among projects, developers, and source code have to be measured. We then describe a massive measurement infrastructure involving discovery, extraction, cleaning, correction, and augmentation of publicly available open-source data from version control systems and other sources. We then illustrate how the key relationships among the nodes representing developers, projects, changes, and files can be accurately measured, how to handle absence of measures for user base in version control data, and, finally, illustrate how such measurement infrastructure can be used to increase knowledge resilience in FLOSS.
more » « less
Full Text Available
Modeling Relationship between Post-Release Faults and Usage in Mobile Software

https://doi.org/10.1145/3273934.3273941

Dey, Tapajit; Mockus, Audris (January 2018, PROMISE'18 Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering)

Background: The way post-release usage of a software affects the number of faults experienced by users is scarcely explored due to the proprietary nature of such data. The commonly used quality measure of post-release faults may, therefore, reflect usage instead of the quality of the software development process. Aim: To determine how software faults and software use are related in a post-deployment scenario and, based on that, derive post-deployment quality measure that reflects developers' performance more accurately. Method: We analyze Google Analytics data counting daily new users, visits, time-on-site, visits per user, and release start date and duration for 169 releases of a complex communication application for Android OS. We utilize Linear Regression, Bayesian Network, and Random Forest models to explain the interrelationships and to derive release quality measure that is relatively stable with respect to variations in software usage. Results: We found the number of new users and release start date to be the determining factors for the number of exceptions, and found no direct link between the intensity and frequency of software usage and software faults. Furthermore, the relative increase in the number of crashes was found to be stably associated with a power of 1.3 relative increase in the number of new users. Based on the findings we propose a release quality measure: number of crashes per user for a release of the software, which was seen to be independent of any other usage variables, providing us with a usage independent measure of software quality. Conclusions: We expect our result and our proposed quality measure will help gauge release quality of a software more accurately and inspire further research in this area.
more » « less
Full Text Available

« Prev Next »